Predicting IPO Performance from Nearest Neighbors Using TF-IDF Weighted Word Count Vectors

نویسندگان

  • Roi Ceren
  • Muthukumaran Chandrasekaran
چکیده

We introduce a novel approach to mining and leveraging data concerning stocks in order to predict the performance of new stocks following their initial public offering, a traditionally difficult task due to the lack of information and historical performance data. We collect a large corpus of articles for every existing stock between March 1st, 2014 and March 1st, 2015. We create weighted feature vectors by calculating the TF-IDF values for every word that appears in a document about a given stock. We then perform locality-sensitive hashing on these feature vectors and bucket similar stocks into a neighborhood. Using a variety of monotonically decreasing functions, we examine the relationship between 30-day price fluctuations of each IPOs neighbor by propagating each neighbor’s 30-day change to the IPO relative to its distance. Locality-sensitive hashing uncovers interesting relationships between IPOs and nearby neighbors. We observe a strong correlation between neighbor stock distance and respective 30-day performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Title Generation for Machine-Translated Documents

In this paper, we present and compare automatically generated titles for machine-translated documents using several different statistics-based methods. A Naïve Bayesian, a K-Nearest Neighbour, a TF-IDF and an iterative Expectation-Maximization method for title generation were applied to 1000 original English news documents and again to the same documents translated from English into Portuguese,...

متن کامل

LeadLag LDA: Estimating Topic Specific Leads and Lags of Information Outlets

Identifying which outlet in social media leads the rest in disseminating novel information on specific topics is an interesting challenge for information analysts and social scientists. In this work, we hypothesize that novel ideas are disseminated through the creation and propagation of new or newly emphasized key words, and therefore lead/lag of outlets can be estimated by tracking word usage...

متن کامل

Spoken Document Clustering Using Word Confusion Networks

In this paper, we propose a word confusion network (WCN) based approach to perform clustering of the spoken documents and analyze its ability to handle the influence of speech recognition errors. WCN compactly represents multiple confidence weighted recognition hypotheses. Thus it provides scope for improving the clustering accuracy as a result of the likely presence of the correct transcriptio...

متن کامل

Words are not Equal: Graded Weighting Model for Building Composite Document Vectors

Despite the success of distributional semantics, composing phrases from word vectors remains an important challenge. Several methods have been tried for benchmark tasks such as sentiment classification, including word vector averaging, matrix-vector approaches based on parsing, and on-the-fly learning of paragraph vectors. Most models usually omit stop words from the composition. Instead of suc...

متن کامل

VTEX System Description for the NLI 2013 Shared Task

This paper describes the system developed for the NLI 2013 Shared Task, requiring to identify a writer’s native language by some text written in English. I explore the given manually annotated data using word features such as the length, endings and character trigrams. Furthermore, I employ k-NN classification. Modified TFIDF is used to generate a stop-word list automatically. The distance betw...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2015